Exercises

Link to the source.

Solutions

Roman emperors

The first exercise uses a dataset about roman emperors from the tidytuesday project (link).

library(tidyverse)
library(lubridate)

fix_emperors <- function(data) {
  data %>% 
    mutate(
      birth = case_when(
        index %in% c(1, 2, 4, 6) ~ update(birth, year = -year(birth)),
        TRUE                     ~ birth
      ),
      reign_start = case_when(
        index == 1 ~ update(reign_start, year = -year(reign_start)),
        TRUE       ~ reign_start
      )
    )
}


emperors <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-08-13/emperors.csv") %>% 
  fix_emperors()

This is what our data looks like:

emperors

Let’s answer some questions! What was the most popular way to rise to power?

emperors %>% 
  count(rise, sort = TRUE)

Turns out, the best way to become an emperor is to be born an emperor! But there is still a decent chance to seize power by force.

Emperors live a dangerous life!

I what are the most common causes of death among roman emperors? What (or who) killed them?

emperors %>% 
  group_by(cause, killer) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n))

While diseases as a natural cause still make up for a significant portion of deaths, they are quickly followed by less natural causes such as assassination. Mostly by other emperors, but there are also two wives among the killers.

There are of course multiple ways of graphing this. Here, we count the causes and killers and display them as a dodged barchart.

emperors %>% 
  count(cause, killer, sort = TRUE) %>% 
  ggplot(aes(cause, n, group = paste(cause, killer), fill = killer)) + 
  geom_col(position = position_dodge2(preserve = "single", width = 1.1))

geom_bar can do the counting for us and we switched to a stacked bar chart and a different color palette:

emperors %>% 
  ggplot(aes(cause, group = paste(cause, killer), fill = killer)) + 
  geom_bar(color = "black") +
  scale_fill_viridis_d()

This allows us to compare the main causes as well as make comparisons within each cause.

Which dynasty was the most successful? Firstly, how often did each dynasty reign?

emperors %>% 
  count(dynasty, sort = TRUE)

A simple count may be enough here, but we like our pretty graphs. So we play around with a barplot again.

emperors %>% 
  count(dynasty) %>% 
  mutate(dynasty = fct_reorder(dynasty, n)) %>% 
  ggplot(aes(dynasty, n)) +
  geom_col() +
  geom_text(aes(label = n), color = "white", vjust = 1.3, fontface = "bold")

That’s a lot of Gordians!

But how long where the reigns?

We found out that there are two different ways of calculating the reign length and that those are not equivalent.

If we calculate how long each emperor reigned and then sum those up within each dynasty:

emperors %>% 
  mutate(reign = reign_end - reign_start) %>% 
  group_by(dynasty) %>% 
  summarise(reign = sum(reign)) %>% 
  arrange(desc(reign))

We actually get higher numbers than by subtracting the reign start of the first emperor in each dynasty from the reign end of the last emperor in the dynasty!

emperors %>% 
  arrange(reign_start) %>% 
  group_by(dynasty) %>% 
  summarise(reign = last(reign_end) - first(reign_start)) %>% 
  arrange(desc(reign))

The reason for this is that some emperors started their reign with the previous emperor also still reigning:

emperors %>% 
  arrange(reign_start) %>% 
  mutate(time_between = time_length(reign_start - lag(reign_end), unit = "year") ) %>% 
  ggplot(aes(time_between)) +
  geom_histogram() +
  labs(x = "Years after the previous emperor") +
  theme_minimal()

In some cases this is probably a data quality issue, in some cases it might actually be the case that an emperor decided to now be reigning while another was still on the throne.

emperors %>% 
  mutate(name = fct_reorder(name, reign_start)) %>% 
  ggplot(aes(x = reign_start, xend = reign_end, y = name, yend = name, color = dynasty)) +
  geom_segment(size = 2) +
  # geom_text(aes(label = name)) +
  scale_color_viridis_d() +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = "Time", y = "", title = "Roman Emperors") +
  scale_y_discrete()

There was a question about coloring in the names on the y-axis, here is a package to help you with that: https://wilkelab.org/ggtext/

Which dynasty would you rather be a part of, if your goal is to live the longest?

plt <- emperors %>% 
  mutate(life = death - birth) %>% 
  filter(!is.na(life)) %>% 
  ggplot(aes(dynasty, life)) +
  stat_summary(color = "grey50") +
  geom_point(aes(text = name)) +
  coord_flip() +
  theme_minimal()

plotly::ggplotly(plt)

As always, the answer depends. Are we the gambling type? Gordians have pretty high ceiling for their lifespan but also a pretty low floor…

Dairy Products in the US

Another dataset (link) concerns dairy product consumption per person in the US across a number of years.

dairy <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/milk_products_facts.csv")

Note: after playing a around with the data we would normally then put all the data cleaning steps up here where we import the data so that we see it straight away. Sometimes I like to call the initially imported dataset <name>_raw, e.g. dairy_raw and then after cleaning the data I store it in a variable like dairy_tidy or dairy.

This is whay the initial data looks like:

dairy

All masses are given in lbs (pounds), can you convert them to kg?

There are multiple ways to do that!

lbs_to_kg <- function(x) x / 2.2

dairy %>% 
  mutate(across(-year, lbs_to_kg))

or with an anonymous function:

dairy %>% 
  mutate(across(-year, ~ .x / 2.2))

But we can also remember our tidy data principles and do a pivot first!

dairy <- dairy %>% 
  pivot_longer(-year, names_to = "product", values_to = "consumption") %>% 
  mutate(consumption = consumption / 2.2)

All approaches are valid, but we like the tidy data format for the rest of this analysis. Note though, that the first approach had the added benefit of giving a reasonable name to our calculation lbs_to_kg, so this will be nice if we have to come back to it later to see what we did. Readability counts! We could have combined the first and third approach as well.

Which products lost their customer base over time, which ones won? Which products have the greatest absolute change in production when estimated with a straight line?

Visually:

dairy %>% 
  ggplot(aes(year, consumption, color = product)) +
  geom_smooth(method = "lm", alpha = 0.3) +
  geom_line() +
  facet_wrap(~product, scales = "free") +
  guides(color = "none") +
  theme_minimal() +
  labs(y = "Consumption [kg]") +
  scale_x_continuous(n.breaks = 5)

Numerically:

models <- dairy %>%
  group_by(product) %>% 
  summarise(
    model = list(lm(consumption ~ year)),
    slope = map_dbl(model, ~ broom::tidy(.x)$estimate[2])
  )

models %>% 
  arrange(abs(slope))

Further notes

Tidy evaluatio

By now I have hyped up writing your own functions quite a bit and there will inevitably come the point when you will try to write a function that works like one of the tidyverse functions.

Unfortunately the trade off of being nice to work with interactively, like just referring to columns in a dataframe by their name inside of tidyverse functions, comes at a price. If we want the same, we need to take on extra step.

In the function test below, if we didn’t have the double curly braces {{}}, dplyr would search for a column named column in the dataset! This is of course not there, so you get an error. The curly-curly operator allows us to tell dplyr that this name is just a placeholder for what the user of our function will supply:

test <- function(data, column) {
  data %>% 
    count( {{ column }} )
}

test(emperors, rise)

Finding help

Learning how to find help and ask for help is crucial. The most important tip is learning how a “Reproducible Example”, a “reprex” works

https://www.tidyverse.org/help/#reprex

More Resources.

---
title: "Solution 4"
author: "Jannik Buhr"
date: "12/11/2021"
output: 
  html_document: 
    code_download: true
    df_print: paged
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```

# Exercises

[Link](https://jmbuhr.de/dataintro/functional-programming.html) to the source.

# Solutions

## Roman emperors

The first exercise uses a dataset about roman emperors
from the tidytuesday project
([link](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-08-13)).

```{r}
library(tidyverse)
library(lubridate)

fix_emperors <- function(data) {
  data %>% 
    mutate(
      birth = case_when(
        index %in% c(1, 2, 4, 6) ~ update(birth, year = -year(birth)),
        TRUE                     ~ birth
      ),
      reign_start = case_when(
        index == 1 ~ update(reign_start, year = -year(reign_start)),
        TRUE       ~ reign_start
      )
    )
}


emperors <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-08-13/emperors.csv") %>% 
  fix_emperors()
```

This is what our data looks like:

```{r}
emperors
```

Let's answer some questions!
What was the most popular way to rise to power?

```{r}
emperors %>% 
  count(rise, sort = TRUE)
```

Turns out, the best way to become an emperor is to be born an emperor!
But there is still a decent chance to seize power by force.

> Emperors live a dangerous life!

I what are the most common causes of death among roman
emperors? What (or who) killed them?

```{r}
emperors %>% 
  group_by(cause, killer) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n))
```

While diseases as a natural cause still make up for a significant
portion of deaths, they are quickly followed by less natural causes
such as assassination. Mostly by other emperors, but there are also
two wives among the killers.

There are of course multiple ways of graphing this.
Here, we count the causes and killers and display them as a dodged
barchart.

```{r}
emperors %>% 
  count(cause, killer, sort = TRUE) %>% 
  ggplot(aes(cause, n, group = paste(cause, killer), fill = killer)) + 
  geom_col(position = position_dodge2(preserve = "single", width = 1.1))
```

`geom_bar` can do the counting for us and we switched to a stacked bar chart
and a different color palette:

```{r}
emperors %>% 
  ggplot(aes(cause, group = paste(cause, killer), fill = killer)) + 
  geom_bar(color = "black") +
  scale_fill_viridis_d()
```

This allows us to compare the main causes as well as make comparisons
within each cause.

 
Which dynasty was the most successful?
Firstly, how often did each dynasty reign?

```{r}
emperors %>% 
  count(dynasty, sort = TRUE)
```

A simple count may be enough here, but we like our pretty graphs.
So we play around with a barplot again.

```{r}
emperors %>% 
  count(dynasty) %>% 
  mutate(dynasty = fct_reorder(dynasty, n)) %>% 
  ggplot(aes(dynasty, n)) +
  geom_col() +
  geom_text(aes(label = n), color = "white", vjust = 1.3, fontface = "bold")
```

That's a lot of Gordians!
 
But how long where the reigns?

We found out that there are two different ways of calculating
the reign length and that those are **not** equivalent.

If we calculate how long each emperor reigned and then sum those
up within each dynasty:

```{r}
emperors %>% 
  mutate(reign = reign_end - reign_start) %>% 
  group_by(dynasty) %>% 
  summarise(reign = sum(reign)) %>% 
  arrange(desc(reign))
```

We actually get higher numbers than by
subtracting the reign start of the first emperor in each
dynasty from the reign end of the last emperor in the dynasty!

```{r}
emperors %>% 
  arrange(reign_start) %>% 
  group_by(dynasty) %>% 
  summarise(reign = last(reign_end) - first(reign_start)) %>% 
  arrange(desc(reign))
```

The reason for this is that some emperors started their
reign with the previous emperor also still reigning:

```{r}
emperors %>% 
  arrange(reign_start) %>% 
  mutate(time_between = time_length(reign_start - lag(reign_end), unit = "year") ) %>% 
  ggplot(aes(time_between)) +
  geom_histogram() +
  labs(x = "Years after the previous emperor") +
  theme_minimal()
```

In some cases this is probably a data quality issue, in some cases
it might actually be the case that an emperor decided to now
be reigning while another was still on the throne.


```{r, fig.height=12}
emperors %>% 
  mutate(name = fct_reorder(name, reign_start)) %>% 
  ggplot(aes(x = reign_start, xend = reign_end, y = name, yend = name, color = dynasty)) +
  geom_segment(size = 2) +
  # geom_text(aes(label = name)) +
  scale_color_viridis_d() +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = "Time", y = "", title = "Roman Emperors") +
  scale_y_discrete()
```

There was a question about coloring in the names on the y-axis,
here is a package to help you with that: <https://wilkelab.org/ggtext/>

Which dynasty would you rather be a part of,
if your goal is to live the longest?
 
```{r}
plt <- emperors %>% 
  mutate(life = death - birth) %>% 
  filter(!is.na(life)) %>% 
  ggplot(aes(dynasty, life)) +
  stat_summary(color = "grey50") +
  geom_point(aes(text = name)) +
  coord_flip() +
  theme_minimal()

plotly::ggplotly(plt)
```

As always, the answer depends.
Are we the gambling type?
Gordians have pretty high ceiling for their lifespan but also a pretty low floor...

## Dairy Products in the US

Another dataset
([link](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-01-29#milk_products_facts))
concerns dairy product consumption per person in the US across a number of years.

```{r}
dairy <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/milk_products_facts.csv")
```

Note: after playing a around with the data we would normally then put
all the data cleaning steps up here where we import the data so that
we see it straight away.
Sometimes I like to call the initially imported dataset `<name>_raw`,
e.g. `dairy_raw` and then after cleaning the data I store it
in a variable like `dairy_tidy` or `dairy`.

This is whay the initial data looks like:

```{r}
dairy
```


All masses are given in lbs (pounds), can you convert them to kg?

There are multiple ways to do that!

```{r, eval=FALSE}
lbs_to_kg <- function(x) x / 2.2

dairy %>% 
  mutate(across(-year, lbs_to_kg))
```

or with an anonymous function:

```{r, eval=FALSE}
dairy %>% 
  mutate(across(-year, ~ .x / 2.2))
```

But we can also remember our tidy data principles and do a pivot first!

```{r}
dairy <- dairy %>% 
  pivot_longer(-year, names_to = "product", values_to = "consumption") %>% 
  mutate(consumption = consumption / 2.2)
```

All approaches are valid, but we like the tidy data format for the
rest of this analysis.
Note though, that the first approach had the added benefit
of giving a reasonable name to our calculation `lbs_to_kg`,
so this will be nice if we have to come back to it later to see
what we did. Readability counts!
We could have combined the first and third approach as well. 

Which products lost their customer base over time,
which ones won? Which products have the greatest absolute
change in production when estimated with a straight line?

Visually:

```{r fig.width=12}
dairy %>% 
  ggplot(aes(year, consumption, color = product)) +
  geom_smooth(method = "lm", alpha = 0.3) +
  geom_line() +
  facet_wrap(~product, scales = "free") +
  guides(color = "none") +
  theme_minimal() +
  labs(y = "Consumption [kg]") +
  scale_x_continuous(n.breaks = 5)
```

Numerically:

```{r}
models <- dairy %>%
  group_by(product) %>% 
  summarise(
    model = list(lm(consumption ~ year)),
    slope = map_dbl(model, ~ broom::tidy(.x)$estimate[2])
  )

models %>% 
  arrange(abs(slope))
```


# Further notes

## Tidy evaluatio

By now I have hyped up writing your own functions quite a bit
and there will inevitably come the point when you will try
to write a function that works like one of the tidyverse functions.

Unfortunately the trade off of being nice to work with interactively,
like just referring to columns in a dataframe by their name inside
of tidyverse functions, comes at a price.
If we want the same, we need to take on extra step.

In the function `test` below, if we didn't have the
double curly braces `{{}}`, dplyr would search for a column named
`column` in the dataset!
This is of course not there, so you get an error.
The curly-curly operator allows us to tell dplyr that this name
is just a placeholder for what the user of our function will supply:

```{r}
test <- function(data, column) {
  data %>% 
    count( {{ column }} )
}

test(emperors, rise)
```

## Finding help

Learning how to find help and ask for help is crucial.
The most important tip is learning how a "Reproducible Example",
a "reprex" works

<https://www.tidyverse.org/help/#reprex>

More [Resources](https://jmbuhr.de/dataintro/resources-5.html#getting-help-1).

